Readd clustering #281

SimDing · 2022-01-28T18:25:31Z

Concerning issue #116

This adds the clustering that was removed in #89 again, not including any user interfaces.

Clustering

The clustering is implemented in the de.jplag.clustering package.
It includes two clustering algorithms (spectral and agglomerative), preprocessing, decoupling logic, clustering options and a factory class which can be used to run the clustering in just two statements.

Algorithms

Agglomerative Clustering

Uses a bottom-up approach to successively merge similar clusters. It stops once there are no clusters left that are similar enough to merge. An implementation of this algorithm was originally included in the code base, but removed in #89. It is still included because it is a much simpler approach than the spectral clustering.

Spectral Clustering

Spectral clustering is a clustering approach specifically for for graph data. This matches the problem since the similarities between all submissions can be thought of as a fully connected graph.
Spectral clustering works by computing the Laplace Matrices of the graphs and representing the nodes as k-dimensional vectors using it's Eigenvectors.
At that point the resulting vectors can be clustered using a space-partitioning algorithm, I used k-Means++.
Still for both k-Means as well as the reduction to k dimensions, the unknown final number of clusters k needs to be known.
In addition, k-Means++ yields probabilistic results.
To find a good choice for k and a "good" clustering I employ Baysian Optimization.
A metric I found in line with my notion of a "good" clustering is:

"The average of the clusters modularity times their average inner cluster similarity over the number of the clusters connections."
With modularity I mean the measure introduced by Newman, M.; Girvan, M. in "Finding and evaluating community structure in networks (2004)".

Spectral clustering is used by default.

Preprocessing

As it can be advantageous to apply some preprocessing before clustering (in particular with Spectral clustering) I included three options for preprocessing.

CDF Preprocessor

Estimates the cumulative distribution function of all similarities and multiplies each similarity with the CDF evaluated at that similarity. This has the effect of driving the lowest similarities close to zero while hardly changing the highest ones.
Since this preprocessor is non-parametric and worked well during my experiments I made it the default.

Threshold Preprocessor

Suppresses all similarities below a given threshold. Good values for the threshold vary greatly with the set of input submissions.

Percentile Preprocessor

Is the same as the threshold preprocessor, but the threshold is given as a percentile of the calculated similarities, making it more robust.

Options and CLI

The (many) options for clustering are all defined in a new ClusteringObjects class. This class also contains sane defaults, that should allow users the run the clustering without specifying any additional CLI flags or defining them programmatically.

If refactored the CommandLineArgument enum a little because I did need to add addition optimal parameters and that many constructors became confusing. It now works in a builder-pattern-ish fashion. I also added a small class for dealing with groups of arguments (Clustering and Clustering - Preprocessing in the help text).

This is how the help message is now displayed:

Benutzung: jplag [-h] [-l {java,python3,cpp,csharp,char,text,scheme}] [-bc BC] [-v {quiet,long}] [-d] [-S S] [-p P] [-x X] [-t T] [-m M] [-n N] [-r R] [-c {normal,parallel}]
                 [--cluster-skip] [--cluster-alg {AGGLOMERATIVE,SPECTRAL}] [--cluster-metric {AVG,MIN,MAX,INTERSECTION}] [--cluster-spectral-bandwidth bandwidth]
                 [--cluster-spectral-noise noise] [--cluster-spectral-min-runs min] [--cluster-spectral-max-runs max] [--cluster-spectral-kmeans-interations iterations]
                 [--cluster-agglomerative-threshold threshold] [--cluster-agglomerative-inter-cluster-similarity {MIN,MAX,AVERAGE}] [--cluster-pp-none | --cluster-pp-cdf |
                 --cluster-pp-percentile percentile | --cluster-pp-threshold threshold] rootDir [rootDir ...]

JPlag - Maintained by SDQ
Created by IPD Tichy, Guido Malpohl, and others. JPlag logo designed by Sandro Koch. Currently maintained by Sebastian Hahner and Timur Saglam.

Positions-Argumente:
  rootDir                Root-directory that contains submissions

Benannte Argumente:
  -h, --help             zeigt diese Hilfe und beendet sich.
  -l {java,python3,cpp,csharp,char,text,scheme}
                         Select the language to parse the submissions (Standard: java)
  -bc BC                 Path of the directory containing the base code (common framework used in all submissions)
  -v {quiet,long}        Verbosity of the logging (Standard: quiet)
  -d                     Debug parser. Non-parsable files will be stored (Standard: false)
  -S S                   Look in directories <root-dir>/*/<dir> for programs
  -p P                   comma-separated list of all filename suffixes that are included
  -x X                   All files named in this file will be ignored in the comparison (line-separated list)
  -t T                   Tunes the comparison sensitivity by adjusting  the  minimum  token  required  to  be  counted  as  a  matching  section. A smaller <n> increases the
                         sensitivity but might lead to more false-positives
  -m M                   Comparison similarity threshold [0-100]: All comparisons above this threshold will be saved (Standard: 0.0)
  -n N                   The maximum number of comparisons that will be shown in the generated report, if set to -1 all comparisons will be shown (Standard: 30)
  -r R                   Name of the directory in which the comparison results will be stored (Standard: result)
  -c {normal,parallel}   Comparison mode used to compare the programs (Standard: normal)

Clustering:
  --cluster-skip         Skips the clustering (Standard: false)
  --cluster-alg {AGGLOMERATIVE,SPECTRAL}
                         Which clustering algorithm to use. Agglomerative merges similar  submissions  bottom  up. Spectral clustering is combined with Bayesian Optimization
                         to execute the k-Means clustering algorithm multiple times, hopefully finding a "good" clustering automatically. (Standard: SPECTRAL)
  --cluster-metric {AVG,MIN,MAX,INTERSECTION}
                         The metric used for clustering. AVG is intersection over union, MAX can expose some attempts of obfuscation. (Standard: MAX)
  --cluster-spectral-bandwidth bandwidth
                         Bandwidth of the matern kernel in the Gaussian Process used  during  the  search  for  a  good number of clusters for spectral clustering. If a good
                         clustering result is found during the search, numbers of clusters  that  differ  by  something  in range of the bandwidth are also expected to good.
                         (Standard: 20.0)
  --cluster-spectral-noise noise
                         The result of each run in the search for good clusterings are random.  The  noise level models the variance in the "worth" of these results. It also
                         acts as a regularization constant. (Standard: 0.0025000002)
  --cluster-spectral-min-runs min
                         Minimum number of k-Means executions during spectral clustering. With these initial clustering sizes are explored. (Standard: 5)
  --cluster-spectral-max-runs max
                         Maximum number of k-Means executions during spectral  clustering.  Any  execution  after  the  initial  runs tries to balance between exploration of
                         unknown clustering sizes and exploitation of clustering sizes known as good. (Standard: 50)
  --cluster-spectral-kmeans-interations iterations
                         Maximum number of iterations during each execution of the k-Means algorithm. (Standard: 200)
  --cluster-agglomerative-threshold threshold
                         Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. (Standard: 0.2)
  --cluster-agglomerative-inter-cluster-similarity {MIN,MAX,AVERAGE}
                         How to measure the similarity of two clusters during  agglomerative  clustering.  Minimum,  maximum or average similarity between the submissions in
                         each cluster. (Standard: AVERAGE)

Clustering - Preprocessing:
  --cluster-pp-none      Do not use any preprocessing before clustering. Not recommended for spectral clustering. (Standard: false)
  --cluster-pp-cdf       Before clustering, the value of the cumulative distribution function  of  all  similarities is estimated. The similarities are multiplied with these
                         estimates. This has the effect of supressing similarities that are low compared to other similarities. (Standard: false)
  --cluster-pp-percentile percentile
                         Any similarity smaller than the given percentile will be suppressed during clustering.
  --cluster-pp-threshold threshold
                         Any similarity smaller than the given threshold value will be suppressed during clustering.

Technical Stuff

This adds two new maven dependecies: commons-math (k-Means++, vectors, matrices, and many small algorithms used here and there) and mockito (testing)
I tried to minimize coupling between the clustering code and other code
- The main package de.jplag is only coupled to de.jplag.clustering through the ClusteringOptions and ClusteringFactory classes
- The clustering package de.jplag.clustering is only coupled to de.jplag through the JPlagComparison and Submission classes
- The clustering code only uses JPlagComparison and Submission in the ClusteringFactory and ClusteringAdapter classes, the latter replacing submissions with integer indices and comparisons with matrices.

…erly

SimDing · 2022-01-28T18:31:48Z

I'm not quite finished yet, but I have a question and would like you to see my current state:

Currently I've put every setting about the clustering plainly inside the JPlagOptions class. This feels pretty messy. Do you have a suggestion?

tsaglam · 2022-01-31T08:06:53Z

Currently I've put every setting about the clustering plainly inside the JPlagOptions class. This feels pretty messy. Do you have a suggestion?

That is a good question. Currently, I count 14 additional options you added in this PR. The question here would be how many of those the user really modifies. If some are parameters that might be tweaked in the future, but most users will not change the default values, then we should not expose these parameters as options. Even from a usability context, modifying 14 options for clustering alone via the CLI seems excessive and very unlikely (think about the flag you need to define).
Thus we can maybe reduce the number of options to the user by keeping some clustering parameters as internal parameters that may be tweaked by devs but not the users (also on options: they may be tweaked when using JPlag programmatically but not via the CLI).

From a technical standpoint, you could encapsulate the clustering parameters in a data object, but that does not solve the problem of settings these options via the CLI.

SimDing · 2022-02-10T10:43:46Z

The question here would be how many of those the user really modifies

I think the defaults are set kind of sane, so I hope most users would not have to change much.

Even from a usability context, modifying 14 options for clustering alone via the CLI seems excessive

Users would not use all options at the same time.

I see two cases in which a user would want to change the options:

Clustering takes too long?
- Disable clustering
- Use a threshold preprocessor
- (spectral) use less kMeans iterations
- (spectral) use less runs
The result is not good? The problem is that then there is not really much users can do but fiddle with the parameters that change the result of the clustering. I don't know any one of those parameters that would be best kept from users. In that case at most five parameters would be set (preprocessor, preprocessor option, noise, kernel bandwidth, and similarity metric).

A user who had both problems would use 7 options at most.

The only thing I can really remove without bad aftertaste is the option about pruning bad clusters. There does not seem to be a practical reason to look at those.

jplag/src/main/java/de/jplag/clustering/ClusteringFactory.java

jplag/src/main/java/de/jplag/clustering/algorithm/AgglomerativeClustering.java

jplag/src/main/java/de/jplag/clustering/Preprocessing.java

dfuchss

Minor comments :)

jplag/src/main/java/de/jplag/clustering/ClusteringFactory.java

dfuchss · 2022-02-21T22:07:48Z

jplag/src/main/java/de/jplag/clustering/algorithm/InterClusterSimilarity.java

+                float submissionSimilarity = (float) similarityMatrix.getEntry(leftSubmission, rightCluster.get(rightIndex));
+                similarity = (float) this.accumulator.applyAsDouble(similarity, submissionSimilarity);


Instead of casting simply use BiFunction<Double, Double, Float> or replace float by double

Remove traces of git submodule

SimDing added 5 commits January 28, 2022 17:07

Clustering with pseudonymized reports

e8f91f8

Merge branch 'master' into readd-clustering

aa426da

Fix merge

42e4959

Make bgfs not fail when the cauchy function can't be initialized prop…

d649b64

…erly

Fix: Clustering fails when there are no clusters

21449f7

SimDing force-pushed the readd-clustering branch from dfa04bf to 21449f7 Compare January 28, 2022 18:35

SimDing added 4 commits January 28, 2022 19:41

Undo change by java language server

72b9004

Fix javadoc error

ecda92e

Fix: Do not use java 16 features

82a183a

Complain about missing submodule

796620e

tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change PISE-WS21/22 Tasks for the PISE practical course (WS21/22) labels Jan 31, 2022

tsaglam added this to the v3.1.0 milestone Jan 31, 2022

tsaglam linked an issue Feb 9, 2022 that may be closed by this pull request

Readd clustering and min/max/avg scores #116

Closed

3 tasks

Remove unnecessary clustering option

75d7ecd

tsaglam self-assigned this Feb 11, 2022

SimDing added 10 commits February 11, 2022 09:35

Move clustering options to dedicated class

32a1285

Add clustering options to CLI

c26a26f

Clustering tests

7e8472d

Merge branch 'master' into readd-clustering

d76a1ff

Rename top down to agglomerative clustering

b054791

Remove unused class

f7aef77

Apply spotless

f3fb600

Add some more javadoc

0988fec

Spotless

82d1353

Mark clustering options as final

66b6921

SimDing added 2 commits February 18, 2022 17:03

Renameing + JDoc in GP

bc8f8d3

Remove unused methods

5ac7248

dfuchss requested changes Feb 20, 2022

View reviewed changes

SimDing added 13 commits February 20, 2022 19:43

Move preprocessor creation into enum

93110cc

Move cluster similarity calculation into enum

5004fd0

Renaming stuff in BaysianOptimization

82037bc

Renaming in gaussian process

bf8bdc7

Renaming in percentile preprocessor

8496fbb

Clean up ClusteringTest

a8f4922

Spotless

f8ccf50

Rename params in baysian optimization

4ae5ed7

Extract test data from AgglomerativeClusteringTest

10ec0e5

Test spectral clustering with same data as agglomerative clustering

4f9cb03

Tests for threshold preprocessor

b28b7a0

Test for percentile preprocessor

ce66be5

Test for cdf preprocessor

d5e48a2

dfuchss requested changes Feb 21, 2022

View reviewed changes

SimDing added 2 commits February 24, 2022 11:30

Test for cluster

4ada621

Test for clustering result

5797920

SimDing force-pushed the readd-clustering branch from a5bb2a3 to 5797920 Compare February 24, 2022 10:47

tsaglam approved these changes Feb 24, 2022

View reviewed changes

Remove traces of git submodule.

84f4dbd

This comment was marked as outdated.

Sign in to view

Merge pull request #5 from jplag/readd-clustering-submodule-fix

6d82ee7

Remove traces of git submodule

tsaglam merged commit bffc5c9 into jplag:master Feb 24, 2022

tsaglam mentioned this pull request Feb 24, 2022

Readd clustering and min/max/avg scores #116

Closed

3 tasks

SimDing mentioned this pull request Mar 11, 2022

Contributing to JPlag SimDing/JPlag#8

Open

sebinside mentioned this pull request Mar 15, 2022

Update documentation #317

Closed

dfuchss mentioned this pull request Mar 25, 2022

Feature: JPlag 4.0 ls1intum/Artemis#4861

Closed

sebinside mentioned this pull request Apr 11, 2022

Enhance the new Report Viewer #357

Closed

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readd clustering #281

Readd clustering #281

SimDing commented Jan 28, 2022 •

edited

Loading

SimDing commented Jan 28, 2022

tsaglam commented Jan 31, 2022 •

edited

Loading

SimDing commented Feb 10, 2022

dfuchss left a comment

dfuchss Feb 21, 2022

This comment was marked as outdated.

		float submissionSimilarity = (float) similarityMatrix.getEntry(leftSubmission, rightCluster.get(rightIndex));
		similarity = (float) this.accumulator.applyAsDouble(similarity, submissionSimilarity);

Readd clustering #281

Readd clustering #281

Conversation

SimDing commented Jan 28, 2022 • edited Loading

Clustering

Algorithms

Agglomerative Clustering

Spectral Clustering

Preprocessing

CDF Preprocessor

Threshold Preprocessor

Percentile Preprocessor

Options and CLI

Technical Stuff

SimDing commented Jan 28, 2022

tsaglam commented Jan 31, 2022 • edited Loading

SimDing commented Feb 10, 2022

dfuchss left a comment

Choose a reason for hiding this comment

dfuchss Feb 21, 2022

Choose a reason for hiding this comment

This comment was marked as outdated.

SimDing commented Jan 28, 2022 •

edited

Loading

tsaglam commented Jan 31, 2022 •

edited

Loading